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Abstract 



1 Introduction 



We present an approach to labeling short video 
clips with English verbs as event descriptions. A 
key distinguishing aspect of this work is that it 
labels videos with verbs that describe the spa- 
tiotemporal interaction between event partici- 
pants, humans and objects interacting with each 
other, abstracting away all object-class informa- 
tion and fine-grained image characteristics, and 
relying solely on the coarse-grained motion of 
the event participants. We apply our approach 
to a large set of 22 distinct verb classes and 
a corpus of 2,584 videos, yielding two surpris- 
ing outcomes. First, a classification accuracy of 
greater than 70% on a l-out-of-22 labeling task 
and greater than 85% on a variety of 1-out-of- 
10 subsets of this labeling task is independent of 
the choice of which of two different time-series 
classifiers we employ. Second, we achieve this 
level of accuracy using a highly impoverished in- 
termediate representation consisting solely of the 
bounding boxes of one or two event participants 
as a function of time. This indicates that success- 
ful event recognition depends more on the choice 
of appropriate features that characterize the lin- 
guistic invariants of the event classes than on the 
particular classifier algorithms. 



People describe observed visual events using verbs. A 



common assumption in Linguistics (Jackendoff 1983 
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Pinker 1989 ) is that verbs typically characterize the in- 
teraction between event participants in terms of the gross 
changing motion of these participants. Object class and 
image characteristics of the participants are believed to be 
largely irrelevant to determining the appropriate verb label 
for an event. Participants simply fill roJes (such as agent 
and patient) in the spatiotemporal structure of the event 
class described by a verb. For example, an event where 
one participant (the agent) picks up another participant (the 
patient) consists of a sequence of two subevents, where dur- 
ing the first subevent the agent moves towards the patient 
while the patient is at rest and during the second subevent 
the agent moves together with the patient away from the 
original location of the patient. It does not matter whether 
the agent is a human or a cat, or whether the patient is a 
ball or a cup. Moreover, the shapes, sizes, colors, textures, 
etc. of the participants are irrelevant. Additionally, only 
the gross motion characteristics are relevant; it is irrelevant 
whether the participants grow, shrink, bend, vibrate, etc. 
during a pick up event. The precise linear or angular veloc- 
ities and accelerations are likewise irrelevant. 

The objective of this paper is to evaluate this Linguistic as- 
sumption and its relevance to the computer-vision task of 
labeling video events with verbs. In order to evaluate this 
hypothesis, we focus our attention on methods that clas- 
sify events solely on the basis of the gross changing motion 
of the event participants. In doing do, we often expressly 
discard other sources of information such as object class, 
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changing human body posture, and low-level image charac- 
teristics such as shape, size, color, and texture. We do this 
not because we believe that such information could not help 
event recognition but rather to allow us to strongly evaluate 
the above hypothesis. The surprising result of this endeavor 
is that gross changing motion of event participants attains 
greater than 70% accuracy on a l-out-of-22 forced-choice 
labeling task, significantly outperforming chance (4.5%), 
and greater than 85% accuracy on a variety of 1-out-of- 
10 subsets of this labeling task, again significantly outper- 
forming chance (10%). 

As this paper focusses on labeling video events with 
verbs, both the methods and datasets commonly used 
in prior event-classification efforts are not appropriate. 
Such work typically classifies events using object and 
image characteristics and fine-grained shape and motion 
features, such as spatiotemporal volumes ( |Blank et al.| 
L Laptev and Rozenfeld| 



2005 



2008 



et al. 



and tracked feature points (Liu et alT] |2009[ [Schuldt 



2008; Rodrig uez et al.j 



2004) |Wang and Mon} |2009| l. Moreover, many of 



the datasets commonly used in such work do not involve 
people interacting with objects or other people and contain 
event classes that do not depict common verbs. For 
example, the distinctions between wavel and wave2 or 
jump and p jump in the WEIZMANN dataset ( |Blank et al.| 
2005) or the distinctions between Golf -Swing-Back, 
Golf -Swing-Front, and Golf-Swing-Side; 
Kicking-Front and Kicking-Side; or 
Swing-Bench and Swing-SideAngle in the 
Sports Actions dataset (Rodr iguez et al.[[2008] i do not 
correspond to distinctions in verb semantics. The event 
classes side and jack in the WEIZMANN dataset, the 
event classes Swing-Bench and Swing-SideAngle 
in the SPORTS ACTIONS dataset, and the vast ma- 



jority of the event classes in the UCF50 dataset ( |Liu 
et al.j |2009| > (e.g. Basketball, Billiards, 
BreastStroke, CleanAndJerk, HorseRace, 
HulaHoop, MilitaryParade, TaiChi, or YoYo, 
just to name a few) do not correspond to verbs in any 
language. The videos in the KTH dataset (Sc huldt et al.j 



2004) do not reflect the true meanings of any verbs, let 



alone boxing or clapping or waving ones hands. 
Typical actions in specialized domains like ballet (c.f. the 



Ballet dataset (Wang and Mori| |2009| l) are described 
by nouns, not verbs, and often are not part of common 
lay vocabulary. The distinction between the event classes 
golf_swing, tennis_swing, and swing in the 
YOUTUBE dataset ( |Liu et al.| |2009| ) reflect distinctions in 
event participants, not the semantics of the verb swing. 



Siski nd and Morris] ( |1996[ ) presented a technique for label- 
ing video events with verbs based on the changing motion 
patterns of the event participants. However, they only ap- 
plied their technique to a small number of event classes 
(six) and a small corpus of thirty-six videos, six per class. 



Moreover, they derived the changing motion patterns us- 
ing a rudimentary tracker that was specific to color and 
motion using background subtraction. Thus the event par- 
ticipants were limited to people's hands interacting with 
colored blocks in uncluttered desktop environments with 
static backgrounds. In this paper, we employ the same tech- 
nique for labeling video events with verbs but extend it to 
a much larger number of event classes (twenty two) and 
evaluate it on a much larger corpus of 2,584 videos rang- 
ing from 6 to 584 per class. Since the corpus used in the 
present effort exhibits a wide variety of natural event par- 
ticipants in a wide variety of cluttered environments with 
nonstationary backgrounds, this paper employs novel and 
more general-purpose techniques for deriving the changing 
motion patterns. Moreover, Siskind & Morris used only 
one algorithmic method, namely hidden Markov models 
(HMMs), to classify the time series that characterize the 
changing motion patterns. Thus one might conclude that 
the performance of this approach is somehow dependent 
on this choice of classifier. In this paper, we employ two 
distinct time-series classification methods, namely HMMs 
and dynamic time warping (DTW) and demonstrate that 
both achieve essentially identical performance. Thus it ap- 
pears that the strength of the approach results from the gen- 
eral principle of classifying events based on gross changing 
motion patterns, not on the algorithmic particulars. More- 
over, we demonstrate a surprising result. Our front-end 
tracker abstracts each video as one or two moving axis- 
aligned rectangles. Despite such an extremely impover- 
ished representation that passes only 4 or 8 small integers 
per frame between the front-end tracker and the back-end 
time-series classifier, and the fact that all training and clas- 
sification is performed solely on this impoverished repre- 
sentation, both of our classifiers attain greater than 70% 
accuracy on a l-out-of-22 forced-choice labeling task and 
greater than 85% accuracy on a variety l-out-of-10 subsets 
of this task. This supports the common assumption in Lin- 
guistics that the meanings of many common verbs are sen- 
sitive only to gross changing motion patterns of the event 
participants and not the object class or image characteris- 
tics of those participants. 

The paper is organized as follows. Section [2] describes the 
new corpus that we use for this effort. Section [3] describes 
the tracking methods that we employ to abstract each video 
in this corpus to one or two moving axis-aligned rectan- 
gles. Section[4]describes the feature vectors that we extract 
from this impoverished representation and the particulars 
of the training and classification paradigms. Section[5]de- 
scribes our experimental results. Section|6]concludes with 
a discussion of potential improvements. 

2 The Mind's Eye Corpus 

As part of the Mind's Eye program, DARPA has produced 
a video corpus that is specifically designed to support la- 
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Figure 1 : The number of exemplar videos for each verb in 
the DARPA Mind's Eye C-D la corpus. There are 2,584 
videos and 22 verbs in total. 



beling of videos with common verbs. The particulars of 
this corpus were driven by the desire to ground the seman- 
tics of 48 specific English verbs. To date, several compo- 
nents of this corpus have been released to program partic- 
ipants. One portion, C-Dla, containing 2,584 videos, was 
released in late September 2010, while a second portion, C- 
Dlb, containing 1,564 videos, was released in late January 
2011. The videos are provided at 720p@30fps and range 
from 21 frames to 1408 frames in length, with an average 
of 241 frames. The videos in C-Dla range from 21 frames 
to 809 frames in length, with an average of 141 frames. 
Each video is intended to depict one of the 48 specific En- 
glish verbs and collectively all 48 verbs are represented in 
this combined corpus (with unequal numbers of exemplar 
videos). Each video comes labeled with the intended verb 
depiction{] Because verbs often exhibit a range of poly- 
semous and homonymous meanings and also may exhibit 
synonymy where the semantic space of one verb may in- 
clude all or part of the semantics space of another verb, 
DARPA intends to eventually solicit human judgements for 
the association of verb labels with each video. Since such 
human labelings have not yet been produced, in this paper 
we simply take the 'correct' label for each video to be the 
intended verb label provided with the video. Moreover, this 
paper considers only the C-Dla portion that depicts 22 spe- 
cific English verbs. Fig. [T] summarizes the distribution of 
verbs and exemplar videos in this portion of the corpus. 

Conformant to the linguistic observation that object iden- 
tity and class is tangential to the task of labeling a video 
with a verb, different exemplars for each of the verbs in 
C-D 1 a often have the participant roles played by different 
object instances and classes. The C-Dla corpus has a total 
of 26 distinct objects that play a role in the depicted verbs 
as enumerated in Fig. [2] (Note that there are far more dis- 
tinct objects that do not play a role in the depicted verbs 
and serve solely to clutter the background.) Many of these 
objects, however, only appear in the corpus occupying a 



Figure 2: The 26 distinct objects that play a role in the 
depicted verbs in the C-Dla corpus. The starred objects 
are the ones that are most difficult to detect and classify 
reliably. 



very small portion of the field of view and are difficult 
for humans, let alone machines, to detect and classify re- 
liably. The ones that are most difficult to detect and clas- 
sify reliably are starred in Fig. [2] For each of the remaining 
ones, we manually cropped a collection of between 1,500 
and 2,100 exemplars (combining both positive and nega- 



tive samples) to train a part -based object detector (Felzen- 
|szwalb et al.| |2010| l. It is important to stress that we use 
this object detector solely to produce bounding-box infor- 
mation for deriving the gross changing motion patterns of 
the event participants. During event classification, we ex- 
pressly discard the object-class information and confidence 
scores provided by the object detector. In section |6j we 
discuss how one could extend our methods to make use of 
such information and achieve even higher classification ac- 
curacy. 

3 Tracking 



Our corpus- size measurements reflect only the videos in the 
SINGLEJVERB directory of C-Dla, and eliminate from consider- 
ation those videos not labeled with a single verb from this list of 
48 verbs. 



We use Felzenszwalb et al.'s (Felzenszwa lb et aT]|2010| ) 
part-based object detector as a detection source to pro- 
duce axis-aligned rectangles (henceforth detection boxes 
or simply detections or boxes) as a function of time. How- 
ever, it is unreliable alone as a means for characterizing 
gross participant-object motion because it simultaneously 
exhibits a high false-positive rate and a high false-negative 
rate. Moreover, there is no single detection threshold that 
properly trades off the false-positive and false-negative 
rates in a class- or video-independent fashion. Addition- 
ally, the raw detection-confidence values produced by the 
detector, or even their rank ordering, cannot be used on iso- 
lated frames to select the desired detection. Moreover, the 
detector alone cannot distinguish between false positives 
and multiple objects of the same class at close positions in 
the field of view. Likewise, the detector alone does not pro- 
vide temporal-correspondence information in this situation. 
These problems are particularly exacerbated by occlusion, 
where objects enter and leave the field of view or pass in 
front of or behind other objects. In these circumstances, the 
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detection confidence becomes an even less reliable measure 
of the (partially occluded) presence or absence of an object. 
This is a particularly egregious limitation because verbs de- 
scribe interaction among participants and such interaction 
most frequently involves occlusion. 



3.1 Optimal selection of object tracks 

We address all of these issues with a novel technique that 
produces coherent object tracks across a video from collec- 
tions of independent detections in each frame by simulta- 
neously selecting among multiple detections in all frames 
of a video to find the combination of selections that leads to 
a global optimum of a cost function that characterizes the 
overall object- track coherence. While we employ this tech- 
nique using Felzenszwalb et al.'s part -based object detector 
as a detection source, it can be more generally applied to 
any alternate detection source that outputs boxes with con- 
fidence scores. The only requirement is that the confidence 
scores must provide a total ordering of the boxes. The con- 
fidence scores need not be normalized or lie in a particular 
interval. This lax requirement facilitates integrating boxes 
produced by different detection sources into a single coher- 
ent track, simply by providing a correspondence between 
the confidence values produced by the different detection 
sources and how they impact this total order. We avail our- 
selves of this potential in section [3~4] to provide resilience 
in the face of appearance change due to nonrigid motion 
and out-of -plane object rotation. 

One can conceivably use an alternate detection source that 
does not rely on an object detector. For example, one might 
do some form of background subtraction or motion-based 
tracking to separate moving objects from the background 
or some form of bottom-up foreground/background seg- 
mentation or contour completion to segment salient ob- 
jects. Any method that could reliably place bounding boxes 
around event participants as a function of time would suf- 
fice for our purposes. The sole reason that we employ 
an object detector as a detection source is that bottom-up 
methods are currently not sufficiently reliable, while meth- 
ods based on background subtraction or motion detection 
fail to detect non-moving event participants (of which there 
are many in our corpus) and are unreliable in the presence 
of nonstationary backgrounds (such as occur frequently in 
our corpus). 

We apply our detection source independently for each 
frame and each model, biasing this detection source to yield 
few false negatives at the expense of yielding a prepon- 
derance of false positives, and use our tracker to filter out 
the false positives. When using Felzenszwalb et al.'s part- 
based object detector as a detection source, we do this by 
subtracting a fixed offset (which we take to be 1) from the 
learned detection threshold. The particular value of this 
offset is unimportant so long as it yields a sufficiently low 



false-negative rate, as our method reliably selects coherent 
tracks despite an extremely high false-positive rate. The 
only negative impact of choosing too high of an offset is an 
increase in run time. 

Felzenszwalb et al.'s part -based object detector, by default, 
incorporates non-maxima suppression to remove detections 
that overlap more than 50% with detections of higher con- 
fidence. This tends to foil the above process for biasing the 
detector towards few false negatives and many false posi- 
tives. To counter the effect of excessive non-maxima sup- 
pression, we raise the overlap threshold to 80%. This al- 
lows for much better object localization and reduces jitter 
considerably. 

We have found that no amount of the above bias process 
will completely eliminate false negatives. To provide for 
robust production of coherent object tracks that are neces- 
sary for successful event classification, we compensate for 
the remaining false negatives by projecting each detection 
box in each frame forward a fixed number of frames using 
the Kanade-Lucas-Tomasi (KLT) ( |Shi and Tomasi[ [1994; 
Tom asi and Kanade) |1991| l feature tracker. We track the 
KLT features that reside inside each detection box for one 
frame and compute a single velocity vector and divergence 
vector for that detection by computing the average velocity 
and divergence of the KLT features tracked for that box. 
We use the aggregate velocity and divergence vectors to 
project the detection box forward one frame and repeat this 
process. We limit this projection process to 5 frames as it is 
subject to drift, and we need it only to compensate for false 
negatives which are relatively rare as a result of the above 
bias process. We augment the collection of detections to 
include the forward-projected boxes, taking the confidence 
score of a forward-projected box to be that of the original 
detection that was forward projected. 

To select a coherent object track across multiple frames we 
construct a graph with one vertex for each detection in each 
frame and edges connecting all pairs of detections in adja- 
cent frames. The edges are weighted with a cost that in- 
versely measures coherence and we search for a path from 
the first to last frames with minimal total edge weight us- 



ing a dynamic-programming algorithm ( Viterbi 1971) that 
finds a global optimum. This cost is formulated as a linear 
combination of two components, one being the detection 
confidence score and the other being consistency with op- 
tical flow. The latter is taken to be the Euclidean distance 
between the center of a detection box in a given frame and 
a projection of the center of the corresponding detection 
box from the previous frame forward using optical flow. 
This forward-projection process is analogous to the one 
performed to compensate for false negatives except that 
the average velocity vector is computed from dense opti- 
cal flow instead of tracked KLT features. 

In principle, one could use either KLT features or opti- 
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cal flow for either forward-projection process. We find 
that, in practice, KLT features yield better results for the 
forward-projection process used to compensate for false 
negatives while optical flow yields better results for the 
forward-projection process used to compute track coher- 
ence. Also, our track-coherence measure uses only the 
distance between detection-box centers and thus does not 
need a divergence measure. While one could extend the 
track-coherence measure to incorporate such information, 
we find that it yields no improvement in performance. 
In our experiments, we weight the optical-flow compo- 
nent of track coherence ten times less than the detection- 
confidence score. We bias the track-coherence measure to- 
wards detection confidence to prevent production of tracks 
that are consistent with optical flow but do not correspond 
to reliable object detections. Other than this general bias, 
we find that the object tracks produced are largely insensi- 
tive to the precise weighting value. 

3.2 Entering and leaving the field of view 

The algorithm described thus far constructs tracks that span 
the entire video from the first frame to the last frame. We 
allow for objects that enter and leave the field of view sim- 
ply by applying this algorithm to a subinterval of the video. 
The only difficulty in doing so is determining the subinter- 
val boundaries. We take the subinterval to begin at the first 
frame with a detection confidence above a certain thresh- 
old, and end at the last such frame. To derive this threshold, 
we compute a (50 bin) histogram of the maximal detection- 
confidence scores in each frame, over the entire video. One 
expects this histogram to be bimodal since frames in which 
the object is not present will have lower confidence scores, 
as all detections will be false positives. We take the thresh- 
old to be the minimum of the value that maximizes the 
between-class variance (Otsu 1979| l when bipartitioning 
this histogram and the learned detector-confidence thresh- 
old offset by a fixed, but small, amount (0.4). In practice, 
we find that proper selection of the subinterval is largely 
insensitive to the number of bins and the precise threshold 
offset. 

3.3 Multiple instances of the same object class 

We detect multiple tracks of the same object class by re- 
peated application of the above method. In doing so, 
we must prevent subsequent iterations from rediscovering 
tracks produced by earlier iterations. The naive way of do- 
ing this would be to remove detections associated with ear- 
lier tracks. Detection boxes can be deemed to be associated 
with earlier tracks when their centers lie inside detection 
boxes included in those earlier tracks. However, removing 
all such detections runs the risk of precluding overlapping 
tracks, as would happen when objects pass each other in 
the field of view. So instead of removing detections, we 
rescore them with the maximal detection score in the lower 



quartile of scores for that frame. Given the biasing pro- 
cess towards false positives away from false negatives in 
the detection source, boxes in the lower quartile are likely 
to be false positives and undesirable to include in a coher- 
ent track. Rescoring detections in this fashion biases subse- 
quent iterations to find distinct tracks while allowing tracks 
to briefly overlap. 

If one is not careful, there can be crossover at such points of 
overlap, where the object identity is swapped between two 
distinct tracks. We use an object-appearance model to bias 
against such crossover. Color histograms are computed in 



the CIELAB (CLE] [1978) color space of the pixel val- 
ues inside the detection boxes after shrinking those boxes 
by 60% to ameliorate the influence of background pixels 
on these histograms. We then augment the edge-weight 
function to include a coherence measure on object appear- 
ance, taking this coherence measure to be Earth Mover's 
distance (Peleg et aL] |1989| l between the corresponding his- 
tograms. We weight object appearance and detector confi- 
dence equally in the coherence measure, though in practice, 
we find that the object tracks produced are largely insensi- 
tive to the precise weighting. 

3.4 Nonrigid motion and out-of-plane rotation 

Felzenszwalb et al.'s part-based object detector is unreli- 
able as a detection source when there is nonrigid motion 
and out-of-plane rotation. Our tracking framework can 
provide resilience in the face of such unreliability by in- 
tegrating detection boxes from multiple detection sources. 
We do so by training multiple models for Felzenszwalb et 
al.'s part-based object detector for varying object appear- 
ance under nonrigid motion and out-of-plane rotation and 



union the resulting detections. As discussed in section 3.1 



we must insure that the confidence scores allow for com- 
parison between detections produced by different detection 
sources. We do this by offsetting the confidence scores for 
each detection source by the threshold computed in sec- 
tion [O 



The C-D la corpus has little out-of-plane rotation and there- 
fore such does not impact the reliability of the detection 
source. But the corpus does contain one source of nonrigid 
motion, namely changing human body posture. For this 
corpus, it is sufficient to train detectors for three distinct 
postures: standing, crouching, and lying down. 

Integrating multiple detection sources into a single object 
track allows annotation of the detections in that track with 
their source. In particular, this allows temporal annota- 
tion of human motion tracks with their changing posture. 
Conceivably one could use such information to support se- 
lection of the appropriate verb label. Because we wish to 
evaluate the hypothesis that verbs typically characterize the 
gross changing motion of the event participants, we ex- 
pressly discard such information in the experiments per- 
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formed in this paper. 

3.5 Smoothing 

Boxes comprising the recovered object tracks suffer from 
jitter. We remove this jitter by fitting piecewise cubic 
splines to the widths, heights, and x and y center coordi- 
nates of the tracked boxes. A simple selection of smoothing 
parameters suffices for the C-Dla corpus. Since the videos 
in C-Dla have low frame length variance, a constant num- 
ber of spline pieces is adequate. Box x and y center co- 
ordinates are smoothed with 10 pieces, as they can move 
significantly when tracking accelerating objects, for exam- 
ple a bouncing ball. Box widths and heights are smoothed 
with 5 pieces as object shape and size change less drasti- 
cally. 

3.6 Results 

Our tracker runs in time 0(lm + lmn\df\ 2 ) to recover n 
tracks with m detection sources, each yielding d detec- 
tions per frame, doing / frames of forward projection, on 
videos of length I. In practise, the run time is dominated 
by the detection process and the dynamic -programming 
step. Fig. [3] illustrates the operation of our tracker, ren- 
dering the output of each stage. From this video, one can 
clearly see the robustness of our tracker in light of clut- 
tered nonstationary backgrounds, motion that is not per- 
pendicular to the camera axis, an extremely high false- 
positive biased detection rate of the detection source, oc- 
clusion that results from overlapping tracks corresponding 
to interacting objects, nonrigid motion that results from 
changing human body posture, objects entering and leav- 
ing the field of view, and multiple instances of the same 
object class. Moreover, as illustrated in Fig. |4] the fact that 
our tracker finds an optimal coherent track by processing 
the entire video allows it to robustly track objects that ap- 
proach or recede from the camera by a large distance that 
would otherwise be too small in the field of view to reli- 
ably track by methods that did not process the entire video. 
Without the false-positive bias that such a whole-video ap- 
proach allows, Felzenszwalb et al.'s part-based object de- 
tector would not even detect such objects. 

4 Classification 

We convert the collection of object tracks for a video to a 
time-series of real-valued feature vectors and formulate the 
problem of labeling a video with a verb as a time-series 
classification problem. In doing so, we discard all object 
identity and body posture information that is available in 
those tracks. 

For each video, we designate one track as the agent and 
another track (if present) as the patient. The agent is deter- 
mined using a heuristic: people are more likely to be agents 



than inanimate objects are, and bicycles, motorcycles, and 
SUVs are more likely to be agents than other inanimate ob- 
jects because they are driven by people that we might fail 
to detect due to occlusion. Another track (if present) is se- 
lected as the patient using the same heuristic. Ties are bro- 
ken by selecting the track with highest track coherence as 
the agent and the one with second highest track coherence 
as the patient. 

For all videos, we extract a feature vector for each frame 
representing the gross absolute motion of the agent: 

1. x-coordinate of the box center 

2. y-coordinate of the box center 

3. box aspect ratio 

4. derivative of the box aspect ratio 

5. magnitude of the velocity of the box center 

6. direction of the velocity of the box center 

7. magnitude of the acceleration of the box center 

8. direction of the acceleration of the box center 

For videos with two or more object tracks, we also extract a 
feature vector that includes the above absolute motion fea- 
tures representing the independent motion of each of the 
agent and patient along with additional features that de- 
scribe their relative motion: 

1. distance between agent and patient box centers 

2. orientation of vector from agent box center to patient 
box center 

3. derivative of the distance between agent and patient 
box centers 

In all of the above, temporal derivatives and corresponding 
velocities and accelerations are computed as a two-point fi- 
nite difference. Note that we label videos with verbs using 
the gross changing motion patterns of at most two event 
participants. While we could, in principle, label videos 
on the basis of the motion patterns of more event partici- 
pants, if present, by straightforward extension of the above 
feature-vector computation to include absolute features for 
all objects and relative features for all object pairs, we ex- 
pressly refrain from doing so to evaluate the Linguistic hy- 
pothesis that verbs largely describe the interaction between 
an agent and a patient. 

The verbs in C-Dla often have different senses, such as the 
causative/inchoative alternation (the agent bounces vs. the 
agent bounces the patient), that involve a different number 
of participants. In this case, we train two distinct classi- 
fiers, one on all videos characterizing the motion of just the 
agent and one on those videos that have both an agent and 
a patient characterizing the motion of both the agent and 
the patient. When classifying an unseen video with just 
a single object track we use models trained on just agents, 
while when classifying an unseen video with more than one 
object track we use models trained on both agents and pa- 
tients. 
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Figure 3: The output of each stage of our tracker on a single frame, (a) Detections for the person model in red and the 
motorcycle model in green, (b) Forward projections of the detections from the 5 previous frames, (c) The object tracks 
with maximal coherence selected by our dynamic-programming algorithm. Two distinct person tracks are shown in red 
and blue, and a motorcycle track is shown in green, (d) The smoothed tracks. 




Figure 4: Four frames tracking two people and one motorcycle. Two separate person tracks in red and blue, and the 
motorcycle track in green. The tracker is robust despite the fact that one person occludes most of the motorcycle, the tracks 
of the two people overlap, and all three objects become very small as they recede from the camera. 



7 



To evaluate the hypothesis that it is possible to classify 
events solely on the basis of the gross changing motion 
of the event participants and demonstrate the insensitiv- 
ity of this hypothesis to the choice of time-series classi- 
fier, we have run two parallel sets of experiments, one with 
HMMs ( |Baum and Petrie] \196(% and one with DTW (JET] 
Us| [2003] |Sakoe and Chibal [T978> . When using HMMs, 
we train models with 5 states and independent continuous 
output distributions for each feature. We use Gaussian dis- 
tributions for those features that constitute linear quantities 
and Von Mises distributions for those features that consti- 
tute angular quantities. We found that increasing the num- 
ber of states beyond 5 did not significantly improve accu- 
racy. When using DTW, we employ Euclidean distance 
between feature vectors as the distance metric between 
frames and use DTW to extend this metric as a distance 
between frame sequences to construct a nearest-neighbor 
classifier between unseen videos and training exemplars. 

5 Results 

We performed 5-fold cross-validation on the entire C-Dla 
corpus with a l-out-of-22 forced-choice classification task 
using both HMMs and DTW. To do this, we indepen- 
dently partitioned the set of 'correct' exemplars for each 
verb into five random but equally sized components (up to 
quantization). For each of the five cross-validation runs 
we trained on the exemplars in four of the five partitions 
and tested on the exemplars in the remaining partition. 
Fig. [5] gives the recognition accuracy for each classifica- 
tion algorithm for each cross-validation run. Fig. [7] and 
Fig.|8]give the aggregate confusion matrices for each clas- 
sification algorithm across all five cross-validation runs. 
Note the essentially identical performance of HMMs and 
DTW: HMMs exhibits an aggregate classification accuracy 
of 71.9% while DTW exhibits an aggregate classification 
accuracy of 71.3%. Moreover, we attain greater than 85% 
aggregate classification accuracy for three different 1-out- 
of-10 subsets of this forced-choice classification task with 
both HMMs and DTW: arrive bounce dig drop exchange 
give jump kick pickup run (87.4% HMMs, 85.3% DTW), 
bounce dig drop exchange give jump kick pickup pull run 
(87.5% HMMs, 85.1% DTW), and bounce dig drop ex- 
change give jump kick pass pickup pull (86.1% HMMs, 
87.0% DTW). These results support the hypothesis that 
classification accuracy depends more on the correct choice 
of features than on the classification algorithm. 

6 Conclusion 

Our focus in this paper is to evaluate the hypothesis that 
it is possible to label videos with verbs using information 
solely about the gross changing motion of the event par- 
ticipants. There are numerous places where our computa- 
tional methods expressly discard information that is oth- 



HMMs 74.4 72.9 70.5 69.5 72.4 
DTW 70.7 71.2 69.5 75.0 70.2 

Figure 5: Accuracy for HMMs and DTW on the l-out-of- 
22 action classification task for each of the 5 random parti- 
tions of the corpus. 



bag — > lift 

bicycle — > give 

big ball — > appoach \ chase \ catch \ collide 

bucket — > dig 

chair — » give \ collide \ fall 

football — > catch \ throw 

motorbike — » give \ approach \ chase \ leave \ run 

rake — > dig 

shovel — !• dig 

small ball — > collide \ lift 

SUV — > give | approach \ chase \ leave \ catch 
throw | run 

wooden box — > give 



Figure 6: Correlation between object and event class in C- 
Dla. 



erwise available in order to evaluate this hypothesis. Since 
such information might correlate with the underlying event, 
one could extend our classifiers to make use of such infor- 
mation. For example, one might expect that detector confi- 
dence scores would decrease with occlusion and thus cor- 
relate with the object interaction indicative of event class. 
Similarly, one might expect that object class would corre- 
late with event class. Indeed, as shown in Fig. [6] such cor- 
relation significantly reduces the potential verb-label space, 
rendering the verb-labeling task almost trivial. Likewise, 



as discussed in section 3.4 one could augment the time 



series of feature vectors with human body-posture infor- 
mation that is extracted as a by-product of using multiple 
detection sources to provide resilience in the face of out-of- 
plane rotation and nonrigid motion. It is quite unexpected 
that we attain as good results as we have despite expressly 
discarding such information. This supports the common 
assumption in Linguistics that verbs typically characterize 
the interaction between event participants in terms of the 
gross changing motion of these participants. 
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Figure 7 



The aggregate confusion matrices for 5-fold cross validation on the l-out-of-22 classification task using HMMs. The overall accuracy is 71.9%. 
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Figure 



8: The aggregate confusion matrices for 5-fold cross validation on the l-out-of-22 classification task using DTW. The overall accuracy is 71.3%. 



